Based on Chapter1 and 5 of the book Mathematical Engineering of Deep Learning by Benoit Liquet, Sarat Moka and Yoni Nazarathy
Data are available at https://github.com/benoit-liquet/MAEDL
A deep learning model is a very versatile function, \(f_\theta(x)\), where the input \(x\) is typically high dimensional and the parameter \(\theta\) is high dimensional as well.
The function returns an output \(y=f_\theta(x)\), where \(y\) is typically of significantly lower dimension than both \(\theta\) and \(x\).
Given large datasets comprised of many \(x\) inputs matching desired outputs \(y\), it is often possible to find good parameter values \(\theta\) such that \(f_\theta(\cdot)\) approximately satisfies the desired relationship between \(x\) and \(y\).
The process of finding such \(\theta\) is called training. Once trained, the model \(f_\theta(\cdot)\) can be applied to unseen input data, \(x\), with a hope of making good predictions, classifications, or decisions.
Input data \(x\) can be in the form of an image, text, a sound waveform, tabular heterogeneous data, or some other variant.
Output data \(y\) can be the probability of a label indicating the meaning/content of the image, a numerical value, or an object of similar form to \(x\) such as translated text in case \(x\) is text, a masked image in case \(x\) is an image, or similar.
One very popular source of data is the ImageNet database which has been used for the development and benchmarking of many deep learning models.
The database has nearly 15 million color images. In many models a subset of about \(1.5\) million images are used for training where the basic form of \(y\) is given by a label which is one of \(1,000\) categories indicating the content of the image.
Consider the VGG19 model which is one of several popular (now classical) deep learning models for images. For this model, \(x\) is a \(224\times224\) color image. It is thus comprised of \(3\times224\times224 = 150,528\) values, with every coordinate of \(x\) representing the intensity of a specific pixel color (red, green, or blue).
The output \(y\) is a \(1,000\) dimensional vector where each coordinate of the vector corresponds to a different type of object, e.g. car, banana, etc. The numerical value of the coordinate \(y_i\), where say \(i\) is the index which matches banana is the model’s prediction of the probability that the input \(x\) is a picture of a banana.
In 2014 when VGG19 was introduced, it was fed about \(1.5\) million ImageNet images, \(x\), each with a corresponding label, e.g. banana, which is essentially a desired output \(y\). The process of training VGG19 then involved finding good parameters \(\theta\) so that when the model is presented with a new unseen image \(x\), it predicts the label of the image well.
Note that in VGG19, \(\theta\) has a huge number of parameters; \(144\) million!
\(1.5\times 10^6\) input data samples each of size of about \(1.5\times10^5\) (pixel values).
Training data size has about \(2.25 \times 10^{11}\) values (numbers). This data was then used to learn about \(1.44\times10^8\) parameters.
At the time when this specific model was introduced it took days to train and much longer to fine tune. Today such a model may take around 8 hours to train.
Further, it can take about a fifths of a second to make a prediction with this model, that is to evaluate \(f_\theta(x)\).
This training of VGG19 from scratch on ImageNet is not something one would typically do in practice.
We use the pre-trained VGG19 model parameters and adapt them based on another dataset called Fruits 360 hich has nearly \(100,000\) images of fruits
Here we use the fast.ai library with the Python language which also uses PyTorch under the hood.
However, we could have presented alternatives with other languages (e.g. Julia or R) as well as other deep learning libraries such as for example Keras which uses Tensorflow.
code here https://colab.research.google.com/drive/1YOjnlAqY71PspLn0QzoYl5SmcEmXr4GP?usp=sharing
Semantic segmentation of images
Object detection and localization
Convolutional Neural Networks
Generative Adversarial Networks (GAN)
Encoder, Decoder, and Auto-encoder Models
Recurrent Neural Networks, LSTM, and Other Sequence Models
Deep Reinforcement Learning
Fully Connected Neural Networks
Transformer architectures
Diffusion models
Brain: estimated 85 billion neurons
A single human action: movement of an arm may induce the firing of around 50 million such neurons, whereas the identification of a visual object may use the bulk of 150 million neurons that are in the visual cortex.
Deep neural network models are neither brains nor attempts to create artificial brains.
Nevertheless, the development of these models is highly motivated by the biological structure of the brain.
The basic building block of a deep neural network model is the (artificial) neuron abstracting the synapse connection between neurons via a single number called an activation value.
Pioneering and landmark work in AI research was inspired by neuroscience. Since brains are essentially the only complete proof we have for the existence of what we call ‘’general intelligence’’
Many tasks of deep learning models involve the mimicking of human level (or animal level) tasks such understanding images or conversational tasks.
The most well known benchmarks in the world of artificial intelligence is the Turing test, originally named the imitation game when introduced by Alan Turing in 1950.
It is essentially a test to see if a computer can engage in long conversation with a human, without another observing human distinguishing between the computer and the human.
Deep learning models generally only achieve narrow tasks such as pattern recognition, or playing specific games, as opposed to artificial general intelligence (AGI) tasks of creative problem solving.
Distinguish between seen data and unseen data.
Seen data is available for learning, namely for training of models, model selection, parameter tuning, and testing of models.
Unseen data is essentially unlimited. All data from the real world that is not available while learning takes place but is later available when the model is used. This can be data from the future, or data that was not collected or labelled with the seen data.
Sigmoid model is a logistic model
It gives a probability as output:
\[\begin{equation} \label{eq:first-shallow-view} \hat{y}=\underbrace{\sigma\left(\overbrace{b+w^\top x}^{z}\right)}_{a}. \end{equation}\]
Consider the data from Breast Cancer Wisconsin (Diagnostic) (WBCD) dataset
Goal is to predict if a benign (Y = 0) or a
malignant (Y = 1) lumps of a breast mass.
Historical data consists of a large number of patient records
30 (=\(d\)) characteristics of individual cells of breast cancer
Use a machine learning algorithm that observes these data, produces a predictor
Predictor takes as input 30 values, returns a single Boolean
prediction (0 or 1)
This is a classifier, since we are predicting an outcome that takes only two values
Evaluate model by its error rate on a separate test set of data, not used to develop the model
A probabilistic model returns a probability that the patient has a malignant lump, not just a Boolean
\[ \hat{y}=\underbrace{S_{\text{softmax}} \big(\overbrace{b+W x}^{z\in {\mathbb R}^K}\big)}_{a\in {\mathbb R}^K}, \qquad \text{with} \qquad S_{\textrm{Softmax}}(z) = \frac{1}{\sum_{i=1}^{K} e^{z_i}} \begin{bmatrix} e^{z_1} \\ \vdots\\ e^{z_{K}}\\ \end{bmatrix}. \]
The goal of a feedforward network is to approximate some function \(f^*: \mathbb{R}^p \longrightarrow \mathbb{R}^q\). A feedforward network model defines a mapping \(f_\theta: \mathbb{R}^p \longrightarrow \mathbb{R}^q\) and learns the value of the parameters \(\theta\) that ideally result in
\[ f^*(x) \approx f_\theta(x) \]The function \(f_\theta\) is recursively composed via a chain of functions:
\[f_\theta(x)=f_{\theta[L]}^{[L]}\left(f_{\theta[L-1]}^{[L-1]}\left(\ldots\left(f_{\theta[1]}^{[1]}(x)\right) \ldots\right)\right)\]
\(f_{{\mathbb{\theta}}^{[\ell]}}^{(\ell)}: {\mathbb R}^{N_{\ell-1}} \longrightarrow {\mathbb R}^{N_\ell}\) for the \(\ell\)th layer associated parameters \({\mathbb{\theta}}^{[\ell]} \in \Theta^{[\ell]}\), where \(\Theta^{[\ell]}\) is the parameter space for the \(\ell\)th layer.
The of the network is \(L\).
\(N_0 = p\) (the number of features) and \(N_L = q\) (the number of output variables).
The number of neurons in the network is \(\sum_{\ell=1}^L N_\ell\).
The function \(f_{{\mathbb{\theta}}^{[\ell]}}^{(\ell)}\) is defined by an affine transformation followed by an activation function .
Activation functions are the means of introducing non-linearity into the model.
The output of layer \(\ell\) is represented by the vector \(a^{[\ell]}\)
The intermediate result of the affine transformation is represented by the vector \(z^{[\ell]}\)
We denote the output of the model via \(\hat{y}\) and hence, \[ \hat{y} = a^{[L]} = f_\theta(x). \]
The action of \(f_{{\mathbb{\theta}}^{[\ell]}}^{(\ell)}\) is represented as
\(a^{[0]} = x\). The parameters of the \(\ell\)th layer, \(\mathbb{\theta}^{[\ell]}\), are given by:
\(N_{\ell} \times N_{\ell-1}\) weight matrix \(W^{[\ell]} = \big(w^{[\ell]}_{i,j}\big)\)
\(N_\ell\) dimensional bias vector \(b^{[\ell]} = (b_i^{[\ell]})\).
Thus the parameter space of the layer is \(\Theta^{[\ell]} = \Re^{N_{\ell} \times N_{\ell-1}} \times \Re^{N_{\ell}}\).
The activation function \(S^{[\ell]}: \Re^{N_{\ell}} \longrightarrow \Re^{N_{\ell}}\) is a nonlinear multivalued function . For \(\ell = 1,\ldots,L-1\) it is generally of the form \[\begin{equation} \label{eqn:activation-function-vector} S^{[\ell]}(z)=\left[\sigma^{[\ell]}\left(z_{1}\right) ~ \ldots ~\sigma^{[\ell]}\left(z_{N_{\ell}}\right)\right]^{\top},% \ \ \ \ell=1,\ldots,L-1, \end{equation}\]
where \(\sigma^{[\ell]}: \Re \to \Re\) is typically an activation function common to all hidden layers. For the output layer, \(\ell = L\), it is often of a different form depending on the task at hand.
Output layer, \(\ell = L\)
\[\left\{ \begin{eqnarray*} \color{Green} {z_1^{[1]} } &=& \color{Orange} {w_1^{[1]}} ^T \color{Red}x + \color{Blue} {b_1^{[1]} } \hspace{2cm}\color{Purple} {a_1^{[1]}} = \sigma( \color{Green} {z_1^{[1]}} )\\ \color{Green} {z_2^{[1]} } &=& \color{Orange} {w_2^{[1]}} ^T \color{Red}x + \color{Blue} {b_2^{[1]} } \hspace{2cm} \color{Purple} {a_2^{[1]}} = \sigma( \color{Green} {z_2^{[1]}} )\\ \color{Green} {z_3^{[1]} } &=& \color{Orange} {w_3^{[1]}} ^T \color{Red}x + \color{Blue} {b_3^{[1]} } \hspace{2cm} \color{Purple} {a_3^{[1]}} = \sigma( \color{Green} {z_3^{[1]}} )\\ \color{Green} {z_4^{[1]} } &=& \color{Orange} {w_4^{[1]}} ^T \color{Red}x + \color{Blue} {b_4^{[1]} } \hspace{2cm} \color{Purple} {a_4^{[1]}} = \sigma( \color{Green} {z_4^{[1]}} ) \end{eqnarray*}\right.\]
where \(x=(x_1,x_2,x_3)^T\) and \(w_j^{[1]}=(w_{j,1}^{[1]},w_{j,2}^{[1]},w_{j,3}^{[1]},w_{j,4}^{[1]})^T\) (for \(j=1,\ldots,4\)).
Then, the output layer is defined by:
\[\begin{eqnarray*} \color{Green} {z_1^{[2]} } &=& \color{Orange} {w_1^{[2]}} ^T \color{purple}a^{[1]} + \color{Blue} {b_1^{[2]} } \hspace{2cm}\color{Purple} {a_1^{[2]}} = \sigma( \color{Green} {z_1^{[2]}} )\\ \end{eqnarray*}\]
where \(a^{[1]}=(a^{[1]}_1,\ldots,a^{[1]}_4)^\top\) and \(w_1^{[2]}=(w_{1,1}^{[2]},w_{1,2}^{[2]},w_{1,3}^{[2]},w_{1,4}^{[2]})^\top\)
One can use matrix representation for efficiency computation:
\[\begin{equation} \begin{bmatrix} \color{Orange}- & \color{Orange} {w_1^{[1]} }^T & \color{Orange}-\\ \color{Orange}- & \color{Orange} {w_2^{[1] } } ^T & \color{Orange}- \\ \color{Orange}- & \color{Orange} {w_3^{[1]} }^T & \color{Orange}- \\ \color{Orange}- & \color{Orange} {w_4^{[1]} }^T & \color{Orange}- \end{bmatrix} \begin{bmatrix} \color{Red}{x_1} \\ \color{Red}{x_2} \\ \color{Red}{x_3} \end{bmatrix} + \begin{bmatrix} \color{Blue} {b_1^{[1]} } \\ \color{Blue} {b_2^{[1]} } \\ \color{Blue} {b_3^{[1]} } \\ \color{Blue} {b_4^{[1]} } \end{bmatrix} = \begin{bmatrix} \color{Orange} {w_1^{[1]} }^T \color{Red}x + \color{Blue} {b_1^{[1]} } \\ \color{Orange} {w_2^{[1] } } ^T \color{Red}x +\color{Blue} {b_2^{[1]} } \\ \color{Orange} {w_3^{[1]} }^T \color{Red}x +\color{Blue} {b_3^{[1]} } \\ \color{Orange} {w_4^{[1]} }^T \color{Red}x + \color{Blue} {b_4^{[1]} } \end{bmatrix} = \begin{bmatrix} \color{Green} {z_1^{[1]} } \\ \color{Green} {z_2^{[1]} } \\ \color{Green} {z_3^{[1]} } \\ \color{Green} {z_4^{[1]} } \end{bmatrix} \end{equation}\]
and by defining
\[\color{Orange}{W^{[1]}} =
\begin{bmatrix} \color{Orange}- & \color{Orange} {w_1^{[1]}
}^T & \color{Orange}-\\ \color{Orange}- & \color{Orange}
{w_2^{[1] } } ^T & \color{Orange}- \\ \color{Orange}- &
\color{Orange} {w_3^{[1]} }^T & \color{Orange}- \\
\color{Orange}- & \color{Orange} {w_4^{[1]} }^T &
\color{Orange}- \end{bmatrix} \hspace{2cm} \color{Blue} {b^{[1]}}
= \begin{bmatrix} \color{Blue} {b_1^{[1]} } \\ \color{Blue} {b_2^{[1]}
} \\ \color{Blue} {b_3^{[1]} } \\ \color{Blue} {b_4^{[1]}
} \end{bmatrix} \hspace{2cm} \color{Green} {z^{[1]} } =
\begin{bmatrix} \color{Green} {z_1^{[1]} } \\ \color{Green} {z_2^{[1]} }
\\ \color{Green} {z_3^{[1]} } \\ \color{Green} {z_4^{[1]}
} \end{bmatrix} \hspace{2cm} \color{Purple} {a^{[1]} } =
\begin{bmatrix} \color{Purple} {a_1^{[1]} } \\ \color{Purple} {a_2^{[1]}
} \\ \color{Purple} {a_3^{[1]} } \\ \color{Purple} {a_4^{[1]}
} \end{bmatrix}\] we can write \[\color{Green}{z^{[1]} } = W^{[1]} x + b
^{[1]}\] and then by applying Element-wise Independent activation
function \(\sigma(\cdot)\) to the
vector \(z^{[1]}\) (meaning that \(\sigma(\cdot)\) are applied independently
to each element of the input vector \(z^{[1]}\)) we get:
\[\color{Purple}{a^{[1]}} = \sigma (\color{Green}{ z^{[1]} }).\] The output layer can be computed in the similar way:
\[\color{YellowGreen}{z^{[2]} } = W^{[2]} a^{[1]} + b ^{[2]}\]
where
\[\color{Orange}{W^{[2]}} = \begin{bmatrix} \color{Orange} {w_{1,1}^{[2]} } \\ \color{Orange} {w_{1,2}^{[2]} } \\ \color{Orange} {w_{1,3}^{[2]} } \\ \color{Orange} {w_{1,4}^{[2]} } \\ \end{bmatrix} \hspace{2cm} \color{Blue} {b^{[2]}} = \begin{bmatrix} \color{Blue} {b_1^{[2]} } \\ \color{Blue} {b_2^{[2]} } \\ \color{Blue} {b_3^{[2]} } \\ \color{Blue} {b_4^{[2]} } \end{bmatrix} \]
and finally:
\[\color{Pink}{a^{[2]}} = \sigma ( \color{LimeGreen}{z^{[2]} })\longrightarrow \color{red}{\hat{y}}\]
Binary Classification task using R
Prediction task using R
Binary Classification task in Python (google collab)
Prediction task in Python (google collab)
We can generalize this simple previous neural network to a Multi-layer fully-connected neural networks by sacking more layers get a deeper fully-connected neural network defining by the Forward Pass equations.
\[\begin{equation} \label{eqn:forward-pass} \begin{array}{l} \text{Layer 1}~~ \left\{ \begin{array}{rrcl} \quad~~&{z^{[1]} } &=&W^{[1]}\overbrace{x}^{\text{Input}} +b^{[1]} \\ \quad~~&{a^{[1]} } &=&S^{[1]}(z^{[1]}) \\ \end{array} \right.\\[20pt] \text{Layer 2}~~ \left\{ \begin{array}{rrcl} \quad~~&{z^{[2]} } &=&W^{[2]}a^{[1]} +b^{[2]} \\ \quad~~&{a^{[2]} } &=&S^{[2]}(z^{[2]}) \\ \end{array} \right.\\[20pt] \quad \vdots\\[20pt] \text{Layer }L~~ \left\{ \begin{array}{rrcl} &{z^{[L]} } &=&W^{[L]}a^{[L-1]} +b^{[L]} \\ \underbrace{\hat{y}}_{\text{output}} ~= &{a^{[L]} } &=&S^{[L]}(z^{[L]}). \\ \end{array} \right. \end{array} \end{equation}\]
\[\begin{equation} \label{eqn:opened-out-example-1} f_\theta(x)=\underbrace{S^{[2]}(\overbrace{W^{[2]}\underbrace{S^{[1]}(\overbrace{W^{[1]}x+b^{[1] }}^{z^{[1]}} )}_{a^{[1]}}+b^{[2]}}^{z^{[2]}})}_{a^{[2]}}, \end{equation}\]
The deeper network below:
is represented by
\[\begin{equation} \label{eqn:opened-out-example-2} f_\theta(x)=S^{[4]}(W^{[4]}S^{[3]}(W^{[3]}S^{[2]}(W^{[2]}S^{[1]}(W^{[1]}x+b^{[1]})+b^{[2]}) +b^{[3]})+b^{[4]}) \end{equation}\]
The number of parameters is
\[ \underbrace{4\times4 + 4}_{\text{Hidden layer 1}} + \underbrace{3\times 4 + 3}_{\text{Hidden layer 2}}+ \underbrace{5\times 3 + 5}_{\text{Hidden layer 3}} + \underbrace{1 \times 5 +1}_{\text{Output layer}} = 61. \]
The \(i\)th neuron of layer \(\ell\), with \(i=1,\ldots,N_\ell\), is typically composed of both \(z^{[\ell]}_i\) and \(a^{[\ell]}_i\). The transition from layer \(\ell-1\) to layer \(\ell\) takes the output of layer \(\ell-1\), an \(N_{\ell-1}\) dimensional vector, and operates on it as follows,
\[\begin{equation} \label{eq:scalar-nn-full} \substack{\text{Affine} \\ \text{Transformation}}:\left\{\begin{array}{lll} z_{1}^{[\ell]}&=&w_{(1)}^{[\ell]^{\top}} a^{[\ell-1]} +b_{1}^{[\ell]} \\ z_{2}^{[\ell]}&=&w_{(2)}^{[\ell]^{\top}} a^{[\ell-1]} +b_{2}^{[\ell]} \\ &\vdots&\\ z_{N_\ell}^{[\ell]}&=&w_{(N_\ell)}^{[\ell]^{\top}} a^{[\ell-1]} +b_{N_\ell}^{[\ell]} \end{array}\right. \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\substack{\text{Activation } \\ \text{Step}}:\left\{\begin{array}{lll} a_{1}^{[\ell]} &=&\sigma\left(z_{1}^{[\ell]}\right) \\ a_{2}^{[\ell]} &=&\sigma\left(z_{2}^{[\ell]}\right) \\ &\vdots&\\ a_{N_\ell}^{[\ell]} &=&\sigma\left(z_{N_\ell}^{[\ell]}\right) \end{array}\right., \end{equation}\] where, \[ {w_{(j)}^{[\ell]}}^\top = \left[w_{j, 1}^{[\ell]}~ \ldots ~ w_{j, N_{\ell-1}}^{[\ell]}\right], \qquad \text{for} \qquad j=1,\ldots,N_\ell, \] is the \(j\)th row of the weight matrix \(W^{[\ell]}\), and \(b^{[\ell]}_j\) is the \(j\)th element of the bias vector \(b^{[\ell]}\). Hence the parameters associated with neuron \(j\) in layer \(\ell\), are \({w_{(j)}^{[\ell]}}^\top\) and \(b^{[\ell]}_j\).
When counting layers in a neural network we count hidden layers as well as the output layer, but we don’t count an input layer.
It is a four layer neural network with three hidden layers.
Let first create a simple plot function for each activation function and its derivative.
library(ggplot2)
f <- function(x) {x}
plot_activation_function <- function(f, title, range){
ggplot(data.frame(x=range), mapping=aes(x=x)) +
geom_hline(yintercept=0, color='red', alpha=1/4) +
geom_vline(xintercept=0, color='red', alpha=1/4) +
stat_function(fun=f, colour = "dodgerblue3") +
ggtitle(title) +
scale_x_continuous(name='x') +
scale_y_continuous(name='') +
theme(plot.title = element_text(hjust = 0.5))
}\[\sigma(z)=g (z)=\frac{1}{1+e^{-z}}\]
Its derivative :
\[\frac{d}{dz}\sigma(z)=\sigma(z)(1-\sigma(z))\]
f <- function(x){1 / (1 + exp(-x))}
df <- function(x){f(x)*(1-f(x))}
plotf <- plot_activation_function(f, 'Sigmoid', c(-4,4))
plotdf <- plot_activation_function(df, 'Derivative', c(-4,4))A recent invention which stands for Rectified Linear Units.
\[ReLU(z)=\max{(0,𝑧)}\]
Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid but with better performance.
Its derivative :
\[\frac{d}{dz}ReLU(z)= \Bigg\{ \begin{matrix} 1 \enspace if \enspace z > 0 \\ 0 \enspace if \enspace z<0 \\ undefined \enspace if \enspace z = 0 \end{matrix}\]
rec_lu_func <- function(x){ ifelse(x < 0 , 0, x )}
drec_lu_func <- function(x){ ifelse(x < 0 , 0, 1)}
plotf <- plot_activation_function(rec_lu_func, 'ReLU', c(-4,4))
plotdf <- plot_activation_function(drec_lu_func, 'Derivative', c(-4,4))Leaky Relu is a variant of ReLU. Instead of being 0 when \(z<0\), a leaky ReLU allows a small, non-zero, constant gradient \(\alpha\) (usually, \(\alpha=0.01\)). However, the consistency of the benefit across tasks is presently unclear.
\[LeaklyReLU(z)=\max{(\alpha z,𝑧)}\]
Its derivative :
\[\frac{d}{dz}LeaklyReLU(z)= \begin{cases}\alpha & if \ \ z< 0 \\1 & if \ \ z\geq0\\\end{cases}\]
rec_lu_func <- function(x){ ifelse(x < 0 , 0.01*x, x )}
drec_lu_func <- function(x){ ifelse(x < 0 , 0.01, 1)}
plotf <- plot_activation_function(rec_lu_func, 'LeaklyReLU', c(-4,4))
plotdf <- plot_activation_function(drec_lu_func, 'Derivative', c(-4,4))Tanh squashes a real-valued number to the range \([-1, 1]\). It’s non-linear. But unlike Sigmoid, its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.
\[tanh(z) =\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}\]
Its derivative :
\[\frac{d}{dz}tanh(z)=1-tanh(z)^2\]
tanh_func <- function(x){tanh(x)}
dtanh_func <- function(x){1-(tanh(x))**2}
plotf <- plot_activation_function(tanh_func, 'TanH', c(-4,4))
plotdf <- plot_activation_function(dtanh_func, 'Derivative', c(-4,4))-The most common example of this is the softmax activation function, typically used for classification in the last layer \(\ell = L\).
The \(\Re^{K} \to \Re^{K}\) softmax activation function is defined as, \[ S_{\text{softmax}}(z) = \frac{1}{\sum_{i=1}^{K} e^{z_i}} \begin{bmatrix} e^{z_1} & \cdots & e^{z_{K}} \end{bmatrix}^\top, \]
Neural networks are known for being able to \(\color{Blue}{\textrm{approximate}}\) arbitrarily complex functions.
\(\color{Red}{\textrm{A General Approximation Result}}\):
Theorem. Consider a continuous function \(f^*: {\cal K} \to \Re^q\) where \({\cal K} \subset \Re^p\) is a compact set. Then for any any \(\sigma^{[1]}(\cdot)\) non-polynomial activation function and any \(\varepsilon > 0\) there exists an \(N_1\) and parameters \(W^{[1]}\in\Re^{N_1\times p}\), \(b^{[1]} \in \Re^{N_1}\), and \(W^{[2]}\in \Re^{q\times N_1}\), such that the function \[ {f}_\theta(x)=W^{[2]}S^{[1]}(W^{[1]}x+b^{[1]}), \] satisfies \(||{f}_\theta(x)-f^*(x)||<\varepsilon\) for all \(x\in {\cal K}\).
Hence this theorem states that essentially all functions can be approximated to arbitrary precision dictated via \(\varepsilon\).
Practically for complex functions \(f^*(\cdot)\) and small \(\varepsilon\) one may need large \(N_1\).
Yet, the theorem states that it is always possible.
Let try it in a short practice using R.
Let try it in a short practice using Python (google collab).
Expressive power of feedforward neural networks by considering specific stylized functions
multiplication of two inputs: A simple construction of a single hidden layer network with \(p=2\) and \(q=1\), allows to create a function \(f_\theta(\cdot)\), parametrized by \(\lambda >0\), such that for input \(x=(x_1,x_2)\), the function approximately implements multiplication of inputs, \[\begin{equation} f_\theta(x_1,x_2) \approx x_1 x_2. \end{equation}\]
Importantly, the approximation error vanishes as \(\lambda \to 0\)
Requires only \(N_1 = 4\) neurons in the single hidden layer.
The activation function of the output layer is the identity.
There are no bias terms, and the weight matrices are,
\[ W^{[1]} = \begin{bmatrix} \lambda & \lambda \\ -\lambda & -\lambda \\ \lambda & -\lambda \\ -\lambda & \lambda \\ \end{bmatrix}, \qquad \text{and} \qquad W^{[2]}= \begin{bmatrix} \mu & \mu & -\mu & -\mu\\ \end{bmatrix}, \]
with \(\mu=\big({4\lambda^2\ddot{\sigma}(0)}\big)^{-1}\).
\(\ddot{\sigma}(0)\) represents the second derivative of the scalar activation function of the hidden layer (\(\ell = 1\)) at \(0\). Hence the model assumes \(\sigma^{[1]}(\cdot)\) is twice differentiable (at \(0\)) with a non-zero second derivative at zero.
It turns out that
\[ f_\theta(x_1, x_2) = \frac{\sigma\big(\lambda(x_1+x_2)\big)+\sigma\big(\lambda(-x_1-x_2)\big)-\sigma\big(\lambda(x_1-x_2)\big)-\sigma\big(\lambda(-x_1+x_2)\big)}{4 \lambda^2\ddot{\sigma}(0)}, \]
We may now use a Taylor expansion of \(\sigma(\cdot)\) around the origin,
\[\begin{equation} \label{eq:taylor-sigma} \sigma(u)=\sigma(0)+\dot{\sigma}(0) u + \ddot{\sigma}(0) \frac{u^{2}}{2}+{O}\left(u^{3}\right), \end{equation}\]
with \(O(h^k)\) denoting a function such that \(O(h^k)/h^k\) goes to a constant as \(h \to 0\).
\[ f_\theta(x_1, x_2) \equiv \frac{\sigma\big(\lambda(x_1+x_2)\big)+\sigma\big(\lambda(-x_1-x_2)\big)-\sigma\big(\lambda(x_1-x_2)\big)-\sigma\big(\lambda(-x_1+x_2)\big)}{4 \lambda^2\ddot{\sigma}(0)}=x_1 x_2\left(1+{O}\left(\lambda ( x_1^{2}+x_2^{2})\right)\right). \]
Hence as \(\lambda \to 0\) the desired goal becomes exact.
Feature engineering: additional features by transforming existing features.
Example: consider \({p}\) features \({x}_1,\ldots,{x}_{{p}}\).
we wish to construct \(\tilde{p} = p(p+1)/2\) features based on all possible pairwise interactions \({x}_i {x}_j\) for \(i,j =1,\ldots,{p}\).
For instance if \({p} = 1,000\) then we arrive at \(\tilde{p} \approx 500,000\).
Linear model acting on the transformed features \(\tilde{x}\) or a single hidden layer neural network model acting on the original features \(x\) ?.
Linear model we have \(\tilde{f}_\theta(\tilde{x}) = \tilde{w}^\top \tilde{x}\) where the learned weight vector \(\tilde{w}\), has \(\tilde{p}\) parameters.
Single hidden layer NN, there are \(p\) inputs, \(q=1\) output, and \(N_1\) units in the hidden layer. Thus the number of parameters: \(N_1 \times {p} + N_1 + N_1 + 1\)
Not all interactions (product features) are relevant.
Let consider only a fraction \(\alpha\) of the interactions are relevant.
The multiplication example requires 4 hidden units to approximate an interaction.
This example hints at the fact that with \(N_1 \approx 4 \, \alpha \, p\) hidden units we may be able to capture the key interactions.
\[ \underbrace{\frac{1}{2}p(p+1)}_{\text{Linear Model}} \qquad \text{vs.} \qquad \underbrace{4 \alpha p(p+2)+1}_{\text{Neural Network}}. \]
Observe that \(p^2\) is the dominant term in both models but for \(\alpha < 1/8\) and large \(p\), the neural network model has less parameters.
With \(p=1,000\) if \(\alpha = 0.02\) (\(20\) significant interactions) then the linear model has an order of \(500,000\) parameters while the neural network only has order of \(80,000\) parameters.
To gain high expressive power, this model might require a very large number of units (\(N_1\) needs to be very large). Hence gaining significant expressive power may require a very large number of parameters.
The power of deep learning then arises via repeated composition of non-linear activations functions via an increase of depth (an increase of \(L\)).
Note first that if the identity activation function is used in each hidden layer, then the network reduces to a shallow neural network, \[ f_\theta(x)=S^{[L]}(\tilde{W} x + \tilde{b}), \]
\[ \tilde{W} = W^{[L]}W^{[L-1]}\cdot \ldots \cdot W^{[1]}, \qquad \text{and} \qquad \tilde{b} = \sum_{\ell = 1}^L \Big(\prod_{\tilde{\ell} = \ell + 1}^L W^{[\tilde{\ell}]}\Big) b^{[\ell]}. \]
The model reduces to be a linear (affine) model. Thus, we have no gain by going deeper and adding multiple layers with identity activations.
The expressivity of the neural network comes from the composition of non-linear activation functions.
The repeated compositions of such functions can reduce the number of units needed in each layer in comparison to a network with a single hidden layer. A consequence is that the parameter space is reduced as well.
We revisit the previous example involving models using interactions of order 2 (i.e. \(x_ix_j\)). Let us consider a higher complexity model by exploiting potential high-order interactions, namely products of r inputs.
Consider a fully connected feedforward network with \(p\) inputs, \(q=1\) output, and \(L \ge 2\) layers with same size (\(N^{[\ell]}=N\) in each layer).
With \(N \approx 4 \, \alpha \, p\) hidden units we may be able to capture the \(\alpha p\) relevant interactions of order \(r=2\).
Then by moving forward in the network, the subsequent addition of a layer with \(N \approx 4 \, \alpha \, p\) hidden units will capture interactions of order \(r=2^2\) and so on, until we capture interaction of order \(r=2^{L}\) at the output layer.
Hence to achieve interactions of order \(r\) we may require \(L \approx \log_2 r\) or \(L = \lceil \log_2 r \rceil\).
The number of parameters is,
\[ \underbrace{N \times {p} +N}_{\text{First hidden layer}}+ \underbrace{(L-2)\times(N^2 + N)}_{\text{Inner hidden layers}} + \underbrace{N+1}_{\text{Output layer}} \approx L N^2. \]
For example, assume we wish to have a model for \(p=1,000\) features that supports about \(20\) meaningful interactions of order \(r=500\).
Hence we can consider \(\alpha = 0.02\). With a model not involving hidden layers, we cannot specialize for an order of \(20\) interactions and thus we require a full model of order \(p^r/r! \approx 10^{365}\) parameters.
This deep construction which can capture a desired set of meaningful interactions is clearly more feasible and efficient than the shallow construction of astronomical size
A key algorithm for learning parameters in deep learning models is the Back-Propagation algorithm to get the gradient of the loss function with respect to the parameter.
Back-Propagation implements backward mode automatic differentiation (see Chapter 4 of our book)
Key principle is to explain the chain-rule and to get a recursive expression for gradient flow.
A key elements are intermediate derivative values
\[ \delta^{[\ell]} := \frac{\partial C(a^{[L]}, y \, ; \, \theta) }{\partial z^{[\ell]}}, \qquad \ell = 1,\ldots,L, \]
Using the chain-rule: \[\begin{equation} \label{eqn:delta-recursion-manit} \delta^{[\ell]} = \frac{\partial a^{[\ell]}}{\partial z^{[\ell]}} \frac{\partial z^{[\ell+1]}}{\partial a^{[\ell]}} \frac{\partial C}{\partial z^{[\ell+1]}} =\frac{\partial a^{[\ell]}}{\partial z^{[\ell]}}\frac{\partial z^{[\ell+1]}}{\partial a^{[\ell]}} \delta^{[\ell+1]}, \qquad \ell = L-1,\ldots,1, \end{equation}\]
For the final layer, \(\ell = L\), we have, \[\begin{equation} \label{eqn:last-delta} \delta^{[L]} = \frac{\partial a^{[L]}}{\partial z^{[L]}}\frac{\partial C}{\partial a^{[L]}}. \end{equation}\]
The key back propagation recursions:
From a practical perspective, these steps are sometimes subject to instability when the number of layers \(L\) is large.
\[\begin{equation} \label{eq:repated-matrix-stuff} \hat{y}=a^{[L]}=W^{[L]} W^{[L-1]} W^{[L-2]}\cdot \ldots \cdot W^{[3]} W^{[2]} W^{[1]} x =W^{[L]} W^{L-1} x \end{equation}\] where \(W^{L-1}\) is the \(L-1\) power of \(W\).
Unless the maximal eigenvalues of \(W\) are exactly with a magnitude of unity, as \(L\) grows we have that \(\hat{y}\) either vanishess (towards \(0\)) or explodes (with values of increasing magnitude).
If \(W = wI\) (a constant multiple of the identity matrix), then \(\hat{y}=W^{[L]} w^{L-1} x\), and for any \(w \neq 1\), a vanishing \(\hat{y}\) or exploding \(\hat{y}\) phenomena persists.
This illustration shows that for non-small network depths (large \(L\)), instability issues may arise in the forward pass .
Same type of instability problem can then also persist in the backward pass since the backward recursion \(\delta^{[\ell]} =\textrm{Diag}\big(\dot{\sigma}^{[\ell]}(z^{[\ell]})\big) {W^{[\ell+1]}}^\top \delta^{[\ell+1]}\) also includes repeated matrix multiplications, and if for simplicity we ignore the activation functions and again take a constant matrix \(W\), then,
\[\begin{equation} \label{eq:back-matrix-power-problem} \delta^{[\ell]} = \Big(W^\top\Big)^{L-\ell} \delta^{[L]}. \end{equation}\]
There is often a vanishing or exploding nature of \(\delta^{[\ell]}\) for large \(L\) and low values of \(\ell\) (the first layers of the network).
The gradient values \(g_W^{[\ell]}\) and \(g_b^{[\ell]}\) may get smaller and smaller (vanishing) or larger and larger (exploding) as we go backward with every layer during back propagation.
In the worst case, vanishing gradients, may completely stop the neural network from training, or exploding gradients may throw parameter values towards arbitrary directions.
This may result in oscillations around the minima or even overshooting the optimum again and again.
Another impact of exploding gradients is that huge values of the gradients may cause number overflow resulting in incorrect computations or introductions of NAN (``not a number’’).
Solution:
Gradient descent improvements such as RMSProp, ADAM can help normalize such variation in the gradients. Nevertheless, numerical instability can still persist.
Further, with activation functions such as sigmoid or tanh, in cases of inputs far from \(0\) the gradient components of \(\textrm{Diag}\big(\dot{\sigma}^{[\ell]}(z^{[\ell]})\big)\) may also vanish.
Activation functions such as ReLU or Leaky ReLU handle such problems, yet the overarching phenomenon still persists.
One strategy for mitigating such a problem is based on weight intialization.
Starting with initial values that are either constant or \(0\) for the weights and bias parameters may throw the learning process off.
Such constant initial parameters may impose symmetry on the activation values of the hidden units and in turn prohibit the model from exploiting its expressive power.
Random initialization enables us to break any potential symmetries and is almost always preferable.
General practice: the most basic random intialization approach is to set all parameters of the weight matrices \(W^{[1]},\ldots,W^{[L]}\) as independent and identically distributed standard normal random variables and to set all the entries of the bias vectors \(b^{[1]},\ldots,b^{[L]}\) at \(0\).
A nice animation post on the influence of the weight initialization could be found here Initializing neural networks
For deep networks, heuristic to initialize the weights depending on the non-linear activation function are generally used. The most common practice is to draw the element of the matrix \(W^{[l]}\) from normal distribution with variance \(k/m_{l-1}\), where \(k\) depends on the activation function.
While these heuristics do not completely solve the exploding/vanishing gradients issue, they help mitigate it to a great extent.
for \(ReLU\) activation: \(k=2\)
for \(tanh\) activation: \(k=1\). The heuristic is called Xavier initialization.
Another commonly used heuristic is to draw from normal distribution with variance \(2/(m_{l-1}+m_l)\)
Batch normalization is to normalize (or standardize) not just the input data but also individual neuron values within the intermediate hidden layers or final layer of the network.
Taking \(j\) as an index of a neuron in layer \(\ell\), we may wish to have either \(z_j^{[\ell]}\) or \(a_j^{[\ell]}\) exhibit near-normalized values over the input dataset.
Such normalization of the neuron values then yields more consistent training and mitigates vanishing or exploding gradient problems. It also has a slight regularization effect which may prevent overfitting.
Here outlines normalization of the \(z_j^{[\ell]}\) values, but one may choose to do so for the \(a_j^{[\ell]}\) values instead.
standardization of the data: subtraction of the mean of each feature and division by the standard deviation of the feature.
sample mean and sample standard deviation
\[\begin{equation} \label{eq:stats-mean-var} \overline{x}_i = \frac{1}{n} \sum_{j=1}^n x_i^{(j)}, \qquad s^2_i = \frac{1}{n} \sum_{j=1}^n (x_i^{(j)} - \overline{x}_i)^2. \end{equation}\]
\[\begin{equation} \label{eq:ref-stand-z} z^{(j)}_i = \frac{x^{(j)}_i - \overline{x}_i}{s_i} \qquad \text{for} \qquad j=1,\ldots,n. \end{equation}\]
For feature \(i\), \(z_i^{(1)}, \ldots,z_i^{(n)}\), has a sample mean of exactly \(0\) and a sample standard deviation of exactly \(1\).
Such standardization is useful as it places the dynamic range of the model inputs on a uniform scale and thus improves the numerical stability of algorithms.
\[ z^{(j)}_i = \frac{x^{(j)}_i - x_i^{\text{min}}}{x_i^{\text{max}}-x_i^{\text{min}}} \qquad \text{for} \qquad j=1,\ldots,n. \]
where \(x_i^{\text{min}}\) is the minimum value of the \(i-th\) feature and \(x_i^{\text{max}}\) is the maximum value of the \(i-th\) feature
Scale Invariance: Preserves the shape of the original distribution.
Convergence: Helps in faster convergence during training.
Stability: Reduces the chance of vanishing or exploding gradients.
The main idea of batch normalization is to consider neuron \(j\) in layer \(\ell\) and instead of using \(z^{[\ell]}_j\) to use a transformed version \(\tilde{z}_j^{[\ell]}\).
Such a transformation takes place both in training time and when using the model in production
The transformation aims to position the \(\tilde{z}_j^{[\ell]}\) values so that they have approximately zero mean and unit standard deviation over the data.
Further, the transformation involves a correction using trainable parameters.
During training time, at a given training epoch and for a given mini-batch of size \(n_b\):
\[\begin{equation} \label{eq-bm-mean-std} \hat{\mu}_j^{[\ell]} =\frac{1}{n_b}\sum_{i=1}^{n_b}z_j^{[\ell](i)} \qquad \text{and} \qquad \hat{\sigma}_j^{[\ell]} = \sqrt{ \frac{1}{n_b}\sum_{i=1}^{n_b}(z_j^{[\ell](i)}-\hat{\mu}_j^{[\ell]})^2}, \end{equation}\]
where \(z_j^{[\ell](i)}\) is the value at unit \(j\), at layer \(\ell\), and sample \(i\) within the mini-batch, prior to carrying out normalization.
\[\begin{equation} \label{eq:bnorm1} \bar{z}_j^{[\ell](i)}=\frac{z_j^{[\ell](i)}-\hat{\mu}_j^{[\ell]}}{\sqrt{(\hat{\sigma}_j^{[\ell]})^2+\varepsilon}}, \end{equation}\]
At this point \(\bar{z}_j^{[\ell](i)}\) has nearly zero mean and nearly unit standard deviation for all data samples \(i\) in the mini-batch.
An additional transformation takes place in the form,
\[\begin{equation} \label{eq:bnorm2} \tilde{z}_j^{[\ell](i)}=\gamma_j^{[\ell]} \bar{z}_j^{[\ell](i)} + \beta_j^{[\ell]}, \end{equation}\]
where \(\gamma_j^{[\ell]}\) and \(\beta_j^{[\ell]}\) are trainable parameters.
\(\tilde{z}_j^{[\ell](i)}\) has a standard deviation of approximately \(\gamma_j^{[\ell]}\) and a mean of approximately \(\beta_j^{[\ell]}\) over the data samples \(i\) in the mini-batch.
These parameters are respectively initialized at \(1\) and \(0\), and then as training progresses, \(\gamma_j^{[\ell]}\) and \(\beta_j^{[\ell]}\) are updated using the same learning mechanisms applied to the weights and biases of the network.
Dropout
Addition of Regularization Terms and Weight Decay
Dropout is a popular and efficient regularization technique.
Dropout is a regularization technique where we during training randomly drop units.
The term dropout refers to dropping out units (hidden and visible) in a neural network.
By dropping a unit out, meaning temporarily removed it from the network, along with all its incoming and outgoing connections.
The choice of which units to drop is random.
At any back propagation iteration (forward pass and backward pass) on a mini-batch, only some random subset of the neurons is active. Practically neurons in layer \(\ell\), for \(\ell=0,\ldots,L-1\), have a specified probability \(p_{\text{keep}}^{[\ell]} \in (0,1]\) where if \(p_{\text{keep}}^{[\ell]} = 1\) dropout does not affect the layer, and otherwise each neuron \(i\) of the layer is ``dropped out’’ with probability \(1 - p_{\text{keep}}^{[\ell]}\).
In the backward pass: when neuron \(i\) is dropped out in layer \(\ell\), the weights \(w_{i,j}^{[\ell+1]}\) for all neurons \(j=1,\ldots,N_{\ell+1}\) are updated based on the gradient \([g_W^{[\ell]}]_{ij}\) which is set at \(0\).
With a pure gradient descent optimizer this means that weights \(w_{i,j}^{[\ell+1]}\) are not updated at all during the given iteration, whereas with a momentum based optimizer such as ADAM it means that the descent step for those weights has a smaller magnitude.
In practice, this simple and easy idea of dropout has improved performance of deep neural networks in many empirically tested cases. It is now an integral part of deep learning training.
Dropout can be viewed as an approximation of an \(\color{Red}{\textrm{ensemble method}}\), a general concept from machine learning.
What is \(\color{Red}{\textrm{ensemble learning}}\) ?
When we seek a model \(\hat{y} = f_\theta(x)\), we may use the same dataset to train multiple models that all try to achieve the same task.
We may then combine the models into an \(\color{Blue}{\textrm{ensemble}}\) (model).
The is usually \(\color{Red}{\textrm{more accurate}}\) than each of the individual models.
Consider a scalar output model.
We use \(\color{Blue}{M\ \textrm{models}}\): \(\hat{y}^{\{ i \} } = f_{\theta_{\{i\}}}^{\{i\}}(x)\) for \(i=1,\ldots,M\), where \(\theta_{\{i\}}\) is taken here as the set of parameters of the \(i\)-th model.
The ensemble model on an input \(x\) is then \(\color{Red}{\textrm{the average}}\), \[ f_{\theta}(x) = \frac{1}{M} \sum_{i=1}^M f_{\theta_{\{i\}}}^{\{ i\}}(x), \qquad \text{where} \qquad \theta = (\theta_{\{1\}},\ldots,\theta_{\{M\}}). \]
\(f_\theta(\cdot)\) is more computationally costly since it requires \(M\) models instead of a single model.
Nevertheless, there are benefits.
Assume the models are \(\color{Red}{\textrm{homogenous}}\) in nature and only differ due to randomness in the training process and not the model choice or hyper-parameters.
For some \(\color{Blue}{\textrm{fixed unseen input }\tilde{x}}\) we may treat the output of model \(i\), denoted \(\hat{y}^{\{i\}}_{\theta_{\{i\}}}(\tilde{x})\), as a \(\color{Red}{\textrm{random variable that is identical in distribution to every other model output } \hat{y}^{\{j\}}_{\theta_{\{j\}}}(\tilde{x})}\), yet generally not independent.
We further assume that any pair of model outputs is \(\color{Red}{\textrm{identically distributed to any other pair}}\): \[ \mathbb{E}\big[\hat{y}_{\theta_{\{i\}}}^{\{i\}}(\tilde{x})\big] = \mu, \qquad \textrm{Var}\big(\hat{y}_{\theta_{\{i\}}}^{\{i\}}(\tilde{x})\big) = \sigma^2, \qquad \text{and} \qquad \text{cor}\big(\hat{y}_{\theta_{\{i\}}}^{\{i\}}(\tilde{x}), \hat{y}_{\theta_{\{j\}}}^{\{j\}}(\tilde{x})\big) = \rho, \] where \(\text{cor}(\cdot, \cdot)\) is the correlation between two models \(i \neq j\) and is assumed to be the same for all \(i\), \(j\) pairs.
\[ -\frac{1}{M-1} \le \rho, \qquad \text{or} \qquad 0 \le \rho + \frac{1-\rho}{M}. \]
\[ \mathbb{E}[f_{\theta}(\tilde{x})] = \frac{1}{M} \mathbb{E} \big[\sum_{i=1}^M f_{\theta_{\{i\}}}^{\{i\}}(\tilde{x}) \big] = \mu, \]
and further noting that \(\rho \sigma^2\) is the \(\color{Blue}{\textrm{covariance between any two models}}\) we obtain
\[ \textrm{Var}\big( f_{\theta}(\tilde{x}) \big) = \frac{1}{M^2} \textrm{Var}\Big(\sum_{i=1}^M f_{\theta_{\{i\}}}^{\{i\}}(\tilde{x})\Big) = \frac{1}{M^2} \big(M \sigma^2 + M(M-1) \rho \sigma^2 \big) = \Big( \rho + \frac{1-\rho}{M}\Big)\sigma^2. \]
As the number of models in the ensemble, \(M\), grows, the variance of the ensemble model \(\color{Blue}{\textrm{converges}}\) to \(\rho \sigma^2\).
Since \(\rho \le 1\) and practically \(\rho < 1\), this \(\color{Blue}{\textrm{limiting variance}}\) is less than \(\sigma^2\).
For example if \(\rho = 0.5\) as \(M\) grows the \(\color{Red}{\textrm{variance of the estimator drops}}\) by \(50\%\).
These properties of ensemble models make them \(\color{Blue}{\textrm{very attractive}}\) because the bias does not change but the variance decreases
Nevertheless, deep learning models \(\color{Red}{\textrm{are not easily amenable}}\) for ensemble models because the number of parameters and computational cost (both for training and production) is too high.
Training a single model may sometimes \(\color{Blue}{\textrm{take days and the computational costs of a single evaluation } f_{\theta_{\{i\}}}^{\{i\}}(\tilde{x})}\) are also non-negligible}.
This is where \(\color{Red}{\textrm{dropout comes in}}\).
We may \(\color{Red}{\textrm{loosely view dropout as an ensemble of } M \textrm{ models}}\) where \(M\) is the number of training iterations.
It has been empirically shown that this is a good approximation of the average of all ensemble members.
Addition of a regularization term is another key approach to prevent overfitting and improve generalization performance.
Augmenting the loss with a regularization term \(R_\lambda(\theta)\) restricts the flexibility of the model, and this restriction is sometimes needed to prevent over-fitting.
In the context of deep learning, and especially when ridge regression style regularization is applied, this practice is sometimes called weight decay when considering gradient based optimization.
Take the original loss function \(C(\theta)\) and augment it to be \(\tilde{C}(\theta) = C(\theta) + R_\lambda(\theta)\).
\[ R_\lambda(\theta) = \frac{\lambda}{2} R(\theta), \qquad \text{with} \qquad R(\theta) = \| \theta \|^2 = \theta_1^2 + \ldots + \theta_d^2. \]
Here for notational simplicity we simply consider all the \(d\) parameters of the model as scalars, \(\theta_i\) for \(i=1,\ldots,d\)
Now for simplicity, assume we execute basic gradient descent steps.
With a learning rate \(\alpha > 0\), the update at iteration \(t\) is,
\[ \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla \tilde{C}(\theta^{(t)}). \]
In our ridge regression style penalty case we have \(\nabla \tilde{C}(\theta) = \nabla {C}(\theta) + \lambda \theta\), and hence the gradient descent update can be represented as
\[\begin{equation} \label{eq:weight-decay} \theta^{(t+1)} = (1- \alpha \lambda) \theta^{(t)} - \alpha \nabla {C}(\theta^{(t)}). \end{equation}\]
Assuming that \(\alpha \lambda < 2\), is that it involves shrinkage or weight decay directly on the parameters in addition to gradient based learning.
Practice 4: In this practice, we mainly use the MNIST dataset to explore classification deep neural networks (DNN) models. At the end of this practice, you should be comfortable to use a software package (here keras) to run different models for a classification task. You will explore different models by exploring/tuning different hyperparamaters of the DNN:
Deep Neural network: Practice 4 with Pyton (google collab)
Tutorial on MNIST data using R code (might be slow)